perm filename TALK[KI,ALS] blob
sn#094478 filedate 1974-03-28 generic text, type T, neo UTF8
00100 This is in the nature of a progress report rather than any sort of
00200 finished paper describing a completed piece of work. We at Stanford
00300 have felt that some work was needed on what might be called the front
00400 end of a speech understanding system to bring into balance the over-all
00500 effort on speech understanding that is currently being sponsored by
00600 ARPA.
00700
00702 In the early 60's it was quite fashionable to investigate speech on a
00704 pitch synchronous basis, this in spite of the fact that the facilities
00706 for doing this were quite primative as compared with those that we have
00708 today. The work of Mathews, Miller and David can be cited as an example.
00710 The computational compexities of this approach using Fourier analysis
00712 led some to attempt to obtain similar results by direct analysis in the
00714 time domain. The work of Pinson can be refeerenced an and example. This
00716 lead to the development of a variety of so-called LPC methods which more
00718 recently have been shown to essentially equivalent. With the current
00720 availability of speciallized fast auxillary hardware, the problem of
00722 doing Fourier transforms no longer seems to pose the problem that it once
00724 did and it seems desirable to once again go bact to the general methods
00726 of Mathews, Miller ans David. Feeling that this is currently a neglected
00728 phase we have devoted considerable effort to it.
00800 A few remarks regarding the long term effort in speech recognition will,
00900 I think, restore a sense of perspective. When the current APRA projects
01000 were started there had been a long history of continued study of the
01100 mechanisms of speech production and recognition by such organizations
01200 as The Bell Telephone Laboratories, Haskins Laboratory to name but two
01300 of a long string of organizations which have done and continue to do good
01400 work. Some of this work is far from new. I well remember the state of the
01500 art in 1928 when I joined the Bell Laboratories and became acquainted
01600 with Harvey Fletcher and his associates. In spite of this long continued
01700 effort at basic understanding, as of roughly three years ago the
01800 practical results in terms of orperating speech recognition systems was
01900 essentially nil. Oh, it was true that there had been many demonstrations
02000 of word recognition, perhaps the most successful one being that by
02100 Reddy and Vicens at Stanford, which incidentally was supported by ARPA.
02200 Never-the-less it had become apparent that a continuation of this same
02300 brute force attack on speech recognition without understanding was more
02400 or less a blind alley and that a rather drastic infussion of new ideas
02500 and incidentally of new money was needed if the machine recognition of
02600 continuous speech was ever to become a reality.
02700
02800 As you all know, several major projects were initiated as of that time,
02900 and many of these have been or are to be reported at this conference.
03000 With this massive infusion of new talent and with the loss of our
03100 major workers at Stanford with the departure of Raj Reddy ans several
03200 of his students, we were left with no long term workers in this field
03300 but with the facilities to do speech work and with some students still
03400 working on their degrees. I became interested in the field and I was
03500 faced with th problem as to how we could continue to do useful work in
03600 the speech field with inadequate financial support and a very small group
03700 of people. In surveying the field it seemed to me that most of the
03800 workers had rather assumed that work on the front end had reached the
03900 state of deminishing return and that all of their effort should be
04000 directed to other aspects. With the danger of the pendulum swinging
04100 to far in this direction we decided to direct our efforts exclusively
04200 to the front end.
04300
04400 Two of the more important contributions that we have made have been
04500 reported separately at this conference and will not be described in
04600 detail in this talk.
04700
04800 Let me begin by outlineing some of the ways in which continued work on
04900 the acoustic end can contribute toward the realization of a more
05000 effective overall system.
05100
05200 The first problem is that of isolating those portions of the incoming
05300 acoustic wave that warrent special attention. Speech is a highly redundant
05400 process. Some of this redundancy is introduced simply because of the
05500 limitations
05600 biological limitations of our vocal tract, some because of limitations
05700 of the ear as a transducer, but some of the redundancy performs a very
05800 useful function of compensating for these very limitations and in
05900 making it possible to transmit intelligence by speech in the presence
06000 of background noise and of distortions in transmission. Our problem is
06100 to separate these various aspects, to retain useful redundancies, and
06200 to reduce the amount of information that is left for processing at as
06300 early a stage as possible. As computer scientists, we have at our
06400 command certain analytical tools that do not have direct counterparts
06500 in the human speech recognition channel. We can therefore safely ignore
06600 some aspects of the incoming wave. At the same time, the very great
06700 computational speed at our disposal makes it possible for us to retain
06800 a certain amount of redundancy to make the system robust. Mush of our
06900 work at Stanford has centered around this aspect of the problem.
07000 We have concentrated our efforts on the problem of extracting as much
07100 information from the wave while still in the time domain as possible,
07200 and at the other extreme we have explored mechanisms for using
07300 redundancies in the interest of robustness.
07400
07500
07600 The work in the time domain has been adequately covered in other papers.
07700 I will simply show two illustrations taken from these papers. The first
07800 illustration shows the present performance of an acoustic
07900 segmenter, designed to work in real time and to delineate those
08000 regions of the wave form that can be thought of as being esentially
08100 steady state and those displaying the maximum amount of transition. Both
08200 regions are thought to be of value in aiding recognition.
08300
08400 The second illustration has to do with the location of pitch marks marking
08500 the zero crossings ehich preceed the maximum
08600 excursions of the wave form. We have found that Fourier Transforms or
08700 LPC Transforms which are based on s single period of the input
08800 wave are most revealing as to the configuration of the vocal tract at the
08900 time without the complications introduced by glottal interaction.
09000 Undoubtedly glottal interaction effects have a great deal to do with
09100 those speaker specific characteristics which allow us to identify
09200 the speaker quite irrespective of what he is saying but it is our
09300 belief that they have little or nothing to do with understanding.
09400